Modernize the threaded assembly howto #1068

fredrikekre · 2024-09-25T20:52:36Z

No description provided.

codecov · 2024-09-25T21:03:31Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.73%. Comparing base (c04c5ba) to head (0e9306c).
Report is 2 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1068   +/-   ##
=======================================
  Coverage   93.72%   93.73%           
=======================================
  Files          39       39           
  Lines        6011     6017    +6     
=======================================
+ Hits         5634     5640    +6     
  Misses        377      377

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

fredrikekre · 2024-09-25T21:13:11Z

https://ferrite-fem.github.io/Ferrite.jl/previews/PR1068/howto/threaded_assembly/ for review

termi-official · 2024-09-25T21:19:15Z

I get NaNs every now and then

julia> nK4, nf4 = main(; n = 15, ntasks = 8) #src
  0.163506 seconds (141.79 k allocations: 20.707 MiB, 0.02% compilation time)
(1.759428561238959e13, 0.10506179512844058)

julia> nK4, nf4 = main(; n = 15, ntasks = 8) #src
  0.159455 seconds (141.79 k allocations: 20.707 MiB, 0.01% compilation time)
(1.759428561238959e13, 0.10506179512844058)

julia> nK4, nf4 = main(; n = 15, ntasks = 8) #src
  0.158596 seconds (141.79 k allocations: 20.707 MiB, 0.02% compilation time)
(NaN, NaN)

fredrikekre · 2024-09-25T21:25:55Z

Try after fa2f1d2 ? Curious that it didn't always gave trash data though...

termi-official · 2024-09-25T21:34:17Z

Saw the fix and I am also very confused that it worked in first place (or why it should fail only occasionally).

Do you have access to some cluster node (or @KnutAM ?) to check scaling? Or some other machine built to scale in the number of threads? Ideally with frequency scaling turned off. We should also check thread pinning to cores to give some recommendations here.

termi-official · 2024-09-25T21:36:23Z

The issue remains on head

julia> nK4, nf4 = main(; n = 5, ntasks = 8) #src
  0.013251 seconds (10.54 k allocations: 5.768 MiB, 0.04% compilation time)
(NaN, NaN)

julia> nK4, nf4 = main(; n = 5, ntasks = 8) #src
  0.005826 seconds (10.54 k allocations: 5.768 MiB, 0.08% compilation time)
(4.760410118115698e12, 0.25328245103046515)

julia> nK4, nf4 = main(; n = 5, ntasks = 8) #src
  0.006254 seconds (10.54 k allocations: 5.768 MiB, 0.08% compilation time)
(4.760410118115698e12, 0.25328245103046515)

julia> nK4, nf4 = main(; n = 5, ntasks = 8) #src
  0.008274 seconds (10.54 k allocations: 5.768 MiB, 0.06% compilation time)
(NaN, NaN)

julia> nK4, nf4 = main(; n = 5, ntasks = 8) #src
  0.005915 seconds (10.54 k allocations: 5.768 MiB, 0.08% compilation time)
(NaN, NaN)

Seems to only happen with 8 and 16 threads tho (on an 8 core machine with hyperthreading).

KnutAM · 2024-09-25T21:48:43Z

Great to fix this issue!
Just peaked, and noted an overwrite of cellvalues:

            @local scratch = ScratchData(dh, K, f, cellvalues)
            (; cell_cache, cellvalues, Ke, fe, assembler) = scratch

docs/src/literate-howto/threaded_assembly.jl

fredrikekre · 2024-09-25T22:01:35Z

Ah, nice find. Probably that explains the large amount of allocations too..

termi-official · 2024-09-25T22:10:34Z

The NaN issue seems to be gone now. However, it just removed 1/3 of the allocs.

fredrikekre · 2024-09-25T22:37:54Z

For me it reduced it significantly (see the last commit).

lijas

These are my timings on the cluster with n=30:

Run with nthreads: 1
  8.996648 seconds (902 allocations: 816.172 KiB, 0.00% compilation time)
-----
Run with nthreads: 2
  4.957708 seconds (1.64 k allocations: 1.564 MiB, 0.00% compilation time)
-----
Run with nthreads: 4
  2.547859 seconds (3.12 k allocations: 3.099 MiB, 0.00% compilation time)
-----
Run with nthreads: 8
  1.555725 seconds (6.07 k allocations: 6.168 MiB, 0.00% compilation time)
-----
Run with nthreads: 16
  0.951255 seconds (11.97 k allocations: 12.306 MiB, 0.00% compilation time)
-----
Run with nthreads: 32
  0.614075 seconds (23.78 k allocations: 24.582 MiB, 0.01% compilation time)

lijas · 2024-09-26T07:11:29Z

docs/src/literate-howto/threaded_assembly.jl

+    cell_cache = CellCache(dh)
+    n = ndofs_per_cell(dh)
+    Ke = zeros(n, n)
+    fe = zeros(n)


Previously we allocated the scratch data in a different way: fes = [zeros(n_basefuncs) for i in 1:nthreads] (to avoid cache misses I believe). Is that not needed anymore? This is definitely cleaner.

This constructor will be called once per task yes.

lijas · 2024-09-26T07:12:32Z

docs/src/literate-howto/threaded_assembly.jl

+        OhMyThreads.@tasks for cellidx in color
+            @set scheduler = scheduler
+            ## Obtain a task local scratch and unpack it
+            @local scratch = ScratchData(dh, K, f, cellvalues_tmp)


Is the scratch data only created once per task? It does not look like it because it inside the loop, but maybe that is what the macro is for?

https://juliafolds2.github.io/OhMyThreads.jl/stable/refs/api/#OhMyThreads.@local

fredrikekre · 2024-09-26T08:27:27Z

On

Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 × AMD EPYC 9354 32-Core Processor

with n = 40 I get

julia> data
8-element Vector{Pair{Int64, Float64}}:
   1 => 17.816107
   2 => 8.693293
   4 => 4.78457
   8 => 2.882439
  16 => 1.754894
  32 => 1.330953
  64 => 1.08486
 128 => 0.708531

julia> p = UnicodePlots.lineplot(first.(data), last.(data); xscale = :log2, yscale = :log2);

julia> UnicodePlots.lineplot!(p, first.(data), last(first(data)) ./ first.(data))
              ┌────────────────────────────────────────┐
     2⁴⸱¹⁵⁵¹¹ │⠣⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
              │⠀⠈⢢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
              │⠀⠀⠀⠙⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
              │⠀⠀⠀⠀⠀⠙⢦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
              │⠀⠀⠀⠀⠀⠀⠀⠑⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
              │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠑⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
              │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠳⡤⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
              │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠪⡢⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
              │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠘⢄⠑⠢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
              │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠑⢄⠈⠒⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
              │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠣⡀⠀⠑⢢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
              │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠢⡀⠀⠈⠉⠒⠤⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
              │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⢆⠀⠀⠀⠀⠀⠉⠉⠒⠢⠤⣀⡀⠀⠀⠀⠀⠀│
              │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠑⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠢⢄⠀⠀⠀│
   2⁻⁰⸱⁴⁹⁷⁰⁹⁷ │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠱⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠢⣀│
              └────────────────────────────────────────┘
              ⠀2⁰⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⁷

fredrikekre · 2024-09-26T08:50:39Z

With threadpinning (pinthreads(:cores)):

julia> data
8-element Vector{Pair{Int64, Float64}}:
   1 => 16.316607
   2 => 8.182938
   4 => 4.095896
   8 => 2.060625
  16 => 1.051968
  32 => 0.607133
  64 => 0.354834
 128 => 0.42452

julia> p = lineplot(first.(data), last.(data); xscale = :log2, yscale = :log2);

julia> lineplot!(p, first.(data), last(first(data)) ./ first.(data))
             ┌────────────────────────────────────────┐
    2⁴⸱⁰²⁸²⁷ │⠉⠢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠈⠢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠀⠀⠈⠢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠀⠀⠀⠀⠈⠢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠀⠀⠀⠀⠀⠀⠈⠢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠑⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠑⢦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠙⢦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠙⢖⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠑⢍⠢⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠢⡑⠢⡀⠀⠀⠀⠀⠀⠀⠀⠀│
   2⁻¹⸱⁴⁹⁴⁷⁸ │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠢⡈⠑⠤⣀⣀⡠⠤⠔⠒│
             └────────────────────────────────────────┘
             ⠀2⁰⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⁷

docs/src/literate-howto/threaded_assembly.jl

This patch rewrites the threaded assembly howto to use OhMyThreads.jl which is a better interface to multithreading than using "raw" `at-threads`. Also adds some more prose and explanations.

fredrikekre added the docs label Sep 25, 2024

KnutAM linked an issue Sep 25, 2024 that may be closed by this pull request

Update threading example #760

Closed

fredrikekre commented Sep 25, 2024

View reviewed changes

docs/src/literate-howto/threaded_assembly.jl Outdated Show resolved Hide resolved

fredrikekre commented Sep 25, 2024

View reviewed changes

docs/src/literate-howto/threaded_assembly.jl Outdated Show resolved Hide resolved

lijas reviewed Sep 26, 2024

View reviewed changes

fredrikekre force-pushed the fe/ohmythreads branch from a53d646 to e9b71fe Compare September 26, 2024 08:58

KristofferC reviewed Sep 26, 2024

View reviewed changes

docs/src/literate-howto/threaded_assembly.jl Show resolved Hide resolved

KristofferC reviewed Sep 26, 2024

View reviewed changes

docs/src/literate-howto/threaded_assembly.jl Outdated Show resolved Hide resolved

Modernize the threaded assembly howto

0e9306c

This patch rewrites the threaded assembly howto to use OhMyThreads.jl which is a better interface to multithreading than using "raw" `at-threads`. Also adds some more prose and explanations.

fredrikekre force-pushed the fe/ohmythreads branch from e9b71fe to 0e9306c Compare September 26, 2024 09:32

fredrikekre merged commit 0762d01 into master Sep 26, 2024
11 checks passed

fredrikekre deleted the fe/ohmythreads branch September 26, 2024 09:54

fredrikekre mentioned this pull request Sep 27, 2024

Default split strategy for GreedyScheduler JuliaFolds2/OhMyThreads.jl#124

Closed

fredrikekre mentioned this pull request Oct 4, 2024

improve threading performance example #232

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modernize the threaded assembly howto #1068

Modernize the threaded assembly howto #1068

fredrikekre commented Sep 25, 2024

codecov bot commented Sep 25, 2024 •

edited

Loading

fredrikekre commented Sep 25, 2024

termi-official commented Sep 25, 2024

fredrikekre commented Sep 25, 2024

termi-official commented Sep 25, 2024

termi-official commented Sep 25, 2024

KnutAM commented Sep 25, 2024

fredrikekre commented Sep 25, 2024

termi-official commented Sep 25, 2024

fredrikekre commented Sep 25, 2024

lijas left a comment •

edited

Loading

lijas Sep 26, 2024

fredrikekre Sep 26, 2024

lijas Sep 26, 2024

fredrikekre Sep 26, 2024

fredrikekre commented Sep 26, 2024

fredrikekre commented Sep 26, 2024

Modernize the threaded assembly howto #1068

Modernize the threaded assembly howto #1068

Conversation

fredrikekre commented Sep 25, 2024

codecov bot commented Sep 25, 2024 • edited Loading

Codecov Report

fredrikekre commented Sep 25, 2024

termi-official commented Sep 25, 2024

fredrikekre commented Sep 25, 2024

termi-official commented Sep 25, 2024

termi-official commented Sep 25, 2024

KnutAM commented Sep 25, 2024

fredrikekre commented Sep 25, 2024

termi-official commented Sep 25, 2024

fredrikekre commented Sep 25, 2024

lijas left a comment • edited Loading

Choose a reason for hiding this comment

lijas Sep 26, 2024

Choose a reason for hiding this comment

fredrikekre Sep 26, 2024

Choose a reason for hiding this comment

lijas Sep 26, 2024

Choose a reason for hiding this comment

fredrikekre Sep 26, 2024

Choose a reason for hiding this comment

fredrikekre commented Sep 26, 2024

fredrikekre commented Sep 26, 2024

codecov bot commented Sep 25, 2024 •

edited

Loading

lijas left a comment •

edited

Loading